Paralinguistic information estimation model learning apparatus, paralinguistic information estimation apparatus, and program

ABSTRACT

Paralinguistic information is estimated with high accuracy even when an utterance for which it is difficult to identify paralinguistic information is used for model learning. An acoustic feature extraction unit  11  extracts an acoustic feature from an utterance. An anti-teacher decision unit  12  decides, based on a paralinguistic information label indicating a determination result of paralinguistic information given by a plurality of listeners for each utterance, an anti-teacher label indicating an anti-teacher serving as incorrect paralinguistic information for the utterance. An anti-teacher estimation model learning unit  13  learns, based on an acoustic feature extracted from the utterance and the anti-teacher label, an anti-teacher estimation model for outputting a posterior probability of anti-teacher for an input acoustic feature.

TECHNICAL FIELD

The present invention relates to a technique for estimatingparalinguistic information from speech.

BACKGROUND ART

There is a need for a technique for estimating paralinguisticinformation (e.g., emotions being joy, sadness, anger, or calm) fromspeech. Paralinguistic information is applicable to dialogue controlthat takes into account emotions of the other party in voice dialogue(e.g., changing the topic if the other party is angry, etc.), and mentalhealth diagnosis using speech (e.g., daily speech being recorded topredict mental health conditions from the frequency of sad and angryvoices, etc.).

As a conventional technique, NPL 1 discloses a paralinguisticinformation estimation technique based on machine learning. In NPL 1, asillustrated in FIG. 1, paralinguistic information of a speaker isestimated on input of time-series information of acoustic features(e.g., voice pitch) extracted from speech for each short time frame. Atthis time, an estimation model is used that is based on deep learning inwhich a recurrent neural network (RNN) and a function called anattention mechanism are combined so that it is possible to estimate theparalinguistic information based on partial characteristics of speech(e.g., a sharply reduced volume of voice at the end of speech can beestimated to be a sad feeling). In recent years, paralinguisticinformation estimation models based on deep learning as in NPL 1 havebecome mainstream.

Note that in the conventional technique, only when a plurality oflisteners listen to a certain speech and the majority of the listenersfeel specific paralinguistic information for the speech, the specificparalinguistic information is determined to be the correctparalinguistic information. In the conventional technique, learning isperformed so as to estimate the correct paralinguistic information.

CITATION LIST Non Patent Literature

-   [NPL 1] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech    emotion recognition using recurrent neural networks with local    attention”, in Proc. of ICASSP, 2017, pp. 2227-2231.

SUMMARY OF THE INVENTION Technical Problem

However, even if the conventional technique is used, the accuracy ofparalinguistic information estimation may be insufficient. This isbecause, in the conventional technique, which learns the estimationmodel to identify the only correct paralinguistic information in speech,the identifying of the correct paralinguistic information is a difficulttask even for humans. For example, in emotion estimation (e.g., aproblem of estimating one of joy, sadness, anger, and calm) which is atype of paralinguistic information estimation, the conventionaltechnique uses a pair of a certain speech and the correct emotion aslearning data to learn the emotion estimation model. However, inreality, there are many utterances for which it is difficult to identifythe correct emotion. For example, in a case of three listeners being,there may be an utterance for which two listeners judge “joy” and onelistener judges “calm” (in this case, the correct emotion is “joy” inthe conventional technique). It is difficult to learn thecharacteristics inherent in correct emotions (i.e., “joy”) from suchutterances. As a result, it becomes difficult to correctly learn theestimation model, and the accuracy of paralinguistic informationestimation may decrease.

In view of the technical problems as described above, an object of thepresent invention is to estimate paralinguistic information with highaccuracy even when an utterance for which it is difficult to identifyparalinguistic information is used for model learning.

Means for Solving the Problem

A paralinguistic information estimation model learning device accordingto a first aspect of the present invention includes an anti-teacherdecision unit that decides, based on a paralinguistic information labelindicating a determination result of paralinguistic information given bya plurality of listeners for each utterance, an anti-teacher labelindicating an anti-teacher serving as incorrect paralinguisticinformation for the utterance; and an anti-teacher estimation modellearning unit that learns, based on an acoustic feature extracted fromthe utterance and the anti-teacher label, an anti-teacher estimationmodel for outputting a posterior probability of anti-teacher for aninput acoustic feature.

A paralinguistic information estimation model learning device accordingto a second aspect of the present invention includes an anti-teacherdecision unit that decides, based on a paralinguistic information labelindicating a determination result of paralinguistic information given bya plurality of listeners for each utterance, an anti-teacher labelindicating an anti-teacher serving as incorrect paralinguisticinformation for the utterance; a traditional teacher decision unit thatdecides, based on a paralinguistic information label, a traditionalteacher label indicating a traditional teacher serving as correctparalinguistic information for the utterance; and a multi-taskestimation model learning unit that performs multi-task learning basedon an acoustic feature extracted from the utterance, the anti-teacherlabel, and the traditional teacher label, and learns a multi-taskestimation model for outputting a posterior probability of anti-teacherand a posterior probability of traditional teacher for an input acousticfeature.

A paralinguistic information estimation device according to a thirdaspect of the present invention includes an anti-teacher estimationmodel storage unit that stores the anti-teacher estimation model learnedby the paralinguistic information estimation model learning deviceaccording to the first aspect; and a paralinguistic informationestimation unit that estimates, based on a posterior probability ofanti-teacher obtained by inputting an acoustic feature extracted from aninput utterance into the anti-teacher estimation model, paralinguisticinformation of the input utterance.

A paralinguistic information estimation device according to a fourthaspect of the present invention includes a multi-task estimation modelstorage unit that stores the multi-task estimation model learned by theparalinguistic information estimation model learning device according tothe second aspect; and a paralinguistic information estimation unit thatestimates, based on the posterior probability of traditional teacherobtained by inputting an acoustic feature extracted from an inpututterance into the multi-task estimation model, paralinguisticinformation of the input utterance.

Effects of the Invention

According to the present invention, it is possible to estimateparalinguistic information with high accuracy even when an utterance forwhich it is difficult to identify paralinguistic information is used formodel learning.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a diagram for explaining a conventional paralinguisticinformation estimation model.

FIG. 2 is a diagram illustrating a functional configuration of aparalinguistic information estimation model learning device according toa first embodiment.

FIG. 3 is a diagram illustrating a processing procedure of aparalinguistic information estimation model learning method according tothe first embodiment.

FIG. 4 is a diagram illustrating a functional configuration of aparalinguistic information estimation device according to the firstembodiment.

FIG. 5 is a diagram illustrating a processing procedure of aparalinguistic information estimation method according to the firstembodiment.

FIG. 6 is a diagram illustrating a functional configuration of aparalinguistic information estimation model learning device according toa second embodiment.

FIG. 7 is a diagram illustrating a processing procedure of aparalinguistic information estimation model learning method according tothe second embodiment.

FIG. 8 is a diagram illustrating a functional configuration of aparalinguistic information estimation device according to the secondembodiment.

FIG. 9 is a diagram illustrating a processing procedure of aparalinguistic information estimation method according to the secondembodiment.

FIG. 10 is a diagram illustrating a functional configuration of aparalinguistic information estimation model learning device according toa third embodiment.

FIG. 11 is a diagram illustrating a processing procedure of aparalinguistic information estimation model learning method according tothe third embodiment.

FIG. 12 is a diagram illustrating a functional configuration of aparalinguistic information estimation device according to the thirdembodiment.

FIG. 13 is a diagram illustrating a processing procedure of aparalinguistic information estimation method according to the thirdembodiment.

FIG. 14 is a diagram for explaining a paralinguistic informationestimation model according to the third embodiment.

DESCRIPTION OF EMBODIMENTS

The symbol “{circumflex over ( )}” used in the following descriptionshould be set immediately above the character immediately aftercorrectly, but due to the limitation of the text notation, it is setimmediately before the character. In the formulas, such a symbol is setin its original position, that is, immediately above the character. Forexample, “{circumflex over ( )}c” is expressed by the followingexpression in equations.

ĉ  [Formula 1]

Hereinafter, embodiments of the present invention will be described.Note that, in the drawings, components having the same function aredenoted by the same reference numerals, and duplicate description willbe omitted.

[Points of Invention]

A point of the present invention is intentional estimation of“paralinguistic information that is absolutely incorrect”, therebycontributing to the identification of the correct paralinguisticinformation. While it is difficult for humans to identify the onlycorrect paralinguistic information, it is generally easy to estimateabsolutely-incorrect paralinguistic information. For example, when ahuman listens to a speech, it may be difficult to identify the speech asexpressing joy or calm, but it is often possible to judge that such aspeech does not express “anger” or “sadness”. From this, it may beeasier to estimate absolutely-incorrect paralinguistic information thanto identify the correct paralinguistic information, and it is expectedthat the incorrect paralinguistic information can be estimated with highaccuracy. Further, knowing absolutely-incorrect paralinguisticinformation by using a framework such as an elimination method cancontribute to the identification of the only correct paralinguisticinformation. Hereinafter, the only correct paralinguistic informationwill be referred to as “traditional teacher”, and theabsolutely-incorrect paralinguistic information will be referred to as“anti-teacher”.

In order to realize the above point of the invention, the embodimentsdescribed below are configured as follows.

1. In the embodiments described below, an anti-teacher is decided basedon a result of determining paralinguistic information by a plurality oflisteners. In the present invention, the anti-teacher refers to a pieceof paralinguistic information determined by a certain number oflisteners or less (e.g., 10% or less) among pieces of paralinguisticinformation to be estimated. For example, for four classes of emotionestimation: joy, sadness, anger, and calm, if three listeners judge acertain speech as expressing “joy”, “joy”, and “calm”, respectively, theanti-teacher of that speech refers to two classes: “sadness” and“anger”.

2. The embodiments described below learns an estimation model for theanti-teacher. This estimation model has the same input features andestimation model structure as the conventional technique, but the finalestimation stage implements a model having a multi-label classificationstructure (one speech can be classified into multiple classes at thesame time).

3. The embodiments described below estimate paralinguistic informationby using the estimation model for anti-teacher alone or both theestimation model for anti-teacher and the estimation model fortraditional teacher. In the case where the estimation model foranti-teacher is used alone, the embodiments described below performanti-teacher estimation using the estimation model for anti-teacher, anddetermines the class with the smallest output probability (i.e., theclass with the smallest probability of absolutely-incorrectparalinguistic information) to be a correct paralinguistic informationestimation result. In the case where both the estimation model foranti-teacher and the estimation model for traditional teacher are used,the embodiments described below determine the class with the largestvalue obtained by subtracting the output probability of the estimationmodel for anti-teacher from the output probability of the estimationmodel for traditional teacher to be a correct paralinguistic informationestimation result. Here, the value obtained by subtracting the outputprobability of the estimation model for anti-teacher from the outputprobability of the estimation model for traditional teacher is, that is,a value obtained by subtracting the probability of incorrectparalinguistic information from the probability of the correctparalinguistic information.

First Embodiment

In a first embodiment, paralinguistic information is estimated by usingthe estimation model for anti-teacher alone.

<Paralinguistic Information Estimation Model Learning Device 1>

A paralinguistic information estimation model learning device accordingto the first embodiment learns an anti-teacher estimation model by usinglearning data that includes a plurality of utterances and paralinguisticinformation labels each indicating a determination result ofparalinguistic information given by a plurality of listeners for anutterance. As illustrated in FIG. 2, a paralinguistic informationestimation model learning device 1 according to the first embodimentincludes an acoustic feature extraction unit 11, an anti-teacherdecision unit 12, an anti-teacher estimation model learning unit 13, andan anti-teacher estimation model storage unit 10. This paralinguisticinformation estimation model learning device 1 implements aparalinguistic information estimation model learning method according tothe first embodiment by performing steps of processing by way of exampleillustrated in FIG. 3.

The paralinguistic information estimation model learning device 1 is,for example, a special device configured by loading a special programonto a known or dedicated computer having a central processing unit(CPU), a main storage device (RAM: Random Access Memory), and the like.The paralinguistic information estimation model learning device 1executes each step of processing under the control of the centralprocessing unit, for example. Data input to the paralinguisticinformation estimation model learning device 1 and data obtained by eachstep of processing are stored in the main storage device, for example,and the data stored in the main storage device is read to the centralprocessing unit as needed so that it is used for other steps ofprocessing. At least a part of each processing unit of theparalinguistic information estimation model learning device 1 may becomposed of hardware such as an integrated circuit. Each storage unitincluded in the paralinguistic information estimation model learningdevice 1 can be composed of, for example, a main storage device such asRAM (Random Access Memory), an auxiliary storage device composed of ahard disk, an optical disk, or a semiconductor memory element such as aflash memory (Flash Memory), or middleware such as a relational databaseor a key-value store.

In step S11, the acoustic feature extraction unit 11 extracts a prosodicfeature from an utterance in the learning data. The prosodic feature isa vector including one or more features of fundamental frequency,short-time power, Mel-frequency Cepstral Coefficients (MFCC), zerocrossover rate, Harmonics-to-Noise-Ratio (HNR), and Mel-filter bankoutput. Further, the prosodic feature may be a series vector time bytime (frame by frame) for these features, or may be a vector on acertain time basis or a vector of overall utterance statistics (mean,variance, maximum, minimum, gradient, etc.) for these features. Theacoustic feature extraction unit 11 outputs the extracted prosodicfeature to the anti-teacher estimation model learning unit 13.

In step S12, the anti-teacher decision unit 12 decides an anti-teacherlabel from a paralinguistic information label of the learning data. Theanti-teacher refers to a piece of paralinguistic information determinedby a predetermined threshold number (hereinafter referred to as“anti-teacher threshold value”) of listeners or less (e.g., 10% or less)among pieces of paralinguistic information to be estimated. Theanti-teacher label refers to a vector in which the paralinguisticinformation class of the anti-teacher is 1 and the others are 0. Inother words, the anti-teacher label is not a vector in which oneparalinguistic information class is 1 and the others are 0 as in thetraditional teacher but a vector in which at least one or moreparalinguistic information classes are 1. For example, in the four-classemotion estimation of joy, sadness, anger, and calm, if the anti-teacherthreshold is set to 0.1 and three listeners judge a certain speech as“joy”, “joy”, and “calm”, respectively, the anti-teacher label for thatspeech refers to a four-dimensional vector in which two classes of“sadness” and “anger” are 1 and two classes of “joy” and “calm” are 0.

The anti-teacher label is specifically expressed as follows.

$\begin{matrix}{{t^{*} = \begin{bmatrix}t_{1}^{*} \\\vdots \\t_{K}^{*}\end{bmatrix}},{t_{k}^{*} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu}\frac{1}{N}{\sum\limits_{n = 1}^{N}h_{k}^{n}}} \leq \beta} \\0 & {otherwise}\end{matrix} \right.}} & \left\lbrack {{Formula}\mspace{14mu} 2} \right\rbrack\end{matrix}$

Here, h_(k) ^(n) indicates whether or not the n-th listener felt thek-th paralinguistic information class (1 if the listener did, 0otherwise). K is the total number of paralinguistic information classes.N is the total number of listeners. R is an anti-teacher threshold of 0or more and 1 or less.

The anti-teacher label may be given as an anti-teacher to aparalinguistic information class that has not been determined by anylistener. This corresponds to the case where the anti-teacher thresholdp is set to 0.

The anti-teacher decision unit 12 outputs the decided anti-teacher labelto the anti-teacher estimation model learning unit 13.

In step S13, the anti-teacher estimation model learning unit 13 learnsthe anti-teacher estimation model based on the prosodic feature outputby the acoustic feature extraction unit 11 and the anti-teacher labeloutput by the anti-teacher decision unit 12. As the estimation model, amodel is used here that can handle a multi-label classification problem(a classification problem in which one speech can be classified intomultiple classes at the same time). This is because anti-teachers mayappear in multiple classes for one speech. The estimation model may be amodel based on deep learning as in the conventional technique, or may bea multiclass logistic regression, but the output here can be expressedas a probability value (probability that a certain paralinguisticinformation class is 1). The anti-teacher estimation model learning unit13 stores the learned anti-teacher estimation model in the anti-teacherestimation model storage unit 20.

<Paralinguistic Information Estimation Device 2>

A paralinguistic information estimation device according to the firstembodiment estimates paralinguistic information of an input utterance byusing the learned anti-teacher estimation model. As illustrated in FIG.4, a paralinguistic information estimation device 2 according to thefirst embodiment includes an acoustic feature extraction unit 11, ananti-teacher estimation model storage unit 20, and a paralinguisticinformation estimation unit 21. This paralinguistic informationestimation device 2 implements a paralinguistic information estimationmethod according to the first embodiment by performing steps ofprocessing by way of example illustrated in FIG. 5.

The paralinguistic information estimation device 2 is, for example, aspecial device configured by loading a special program onto a known ordedicated computer having a central processing unit (CPU), a mainstorage device (RAM: Random Access Memory), and the like. Theparalinguistic information estimation device 2 executes each step ofprocessing under the control of the central processing unit, forexample. Data input to the paralinguistic information estimation device2 and data obtained by each step of processing are stored in the mainstorage device, for example, and the data stored in the main storagedevice is read to the central processing unit as needed so that it isused for other steps of processing. At least a part of each processingunit of the paralinguistic information estimation device 2 may becomposed of hardware such as an integrated circuit. Each storage unitincluded in the paralinguistic information estimation device 2 can becomposed of, for example, a main storage device such as RAM (RandomAccess Memory), an auxiliary storage device composed of a hard disk, anoptical disk, or a semiconductor memory element such as a flash memory(Flash Memory), or middleware such as a relational database or akey-value store.

The anti-teacher estimation model storage unit 20 stores theanti-teacher estimation model learned by the paralinguistic informationestimation model learning device 1.

In step S11, the acoustic feature extraction unit 11 extracts a prosodicfeature from the input utterance. The extraction of the prosodic featurecan be performed in the same manner as in the paralinguistic informationestimation model learning device 1. The acoustic feature extraction unit11 outputs the extracted prosodic feature to the paralinguisticinformation estimation unit 21.

In step S21, the paralinguistic information estimation unit 21 estimatesparalinguistic information from the prosodic feature output by theacoustic feature extraction unit 11 based on the anti-teacher estimationmodel stored in the anti-teacher estimation model storage unit 20. Inthe estimation, the class with the lowest output of the anti-teacherestimation model for a certain prosodic feature is regarded as aparalinguistic information estimation result. This corresponds toselecting paralinguistic information that is least likely to be ananti-teacher, that is, paralinguistic information that is not consideredto be “absolutely-incorrect paralinguistic information”. Theparalinguistic information estimation unit 21 outputs the estimationresult of the paralinguistic information as an output of theparalinguistic information estimation device 2.

Second Embodiment

In a second embodiment, paralinguistic information is estimated usingthe estimation model for traditional teacher in addition to theestimation model for anti-teacher. At this time, the paralinguisticinformation is estimated based on a weight difference in the outputresults of the estimation models. This corresponds to performingparalinguistic information estimation in consideration of both“probability that certain paralinguistic information is correct” and“probability that certain paralinguistic information is incorrect”. As aresult, the estimation accuracy of the paralinguistic information isimproved as compared with the case where only one of the probabilitiesis taken into consideration (i.e., each of the conventional techniqueand the first embodiment).

<Paralinguistic Information Estimation Model Learning Device 3>

A paralinguistic information estimation model learning device accordingto the second embodiment learns the anti-teacher estimation model andthe traditional teacher estimation model from the same learning data asin the first embodiment. As illustrated in FIG. 6, the paralinguisticinformation estimation model learning device 3 according to the secondembodiment further includes a traditional teacher decision unit 31, atraditional teacher estimation model learning unit 32, and a traditionalteacher estimation model storage unit 40 in addition to the acousticfeature extraction unit 11, the anti-teacher decision unit 12, theanti-teacher estimation model learning unit 13, and the anti-teacherestimation model storage unit 20 of the first embodiment. Thisparalinguistic information estimation model learning device 3 implementsa paralinguistic information estimation model learning method accordingto the second embodiment by performing steps of processing by way ofexample illustrated in FIG. 7.

Hereinafter, the paralinguistic information estimation model learningdevice 3 according to the second embodiment will be described with afocus on the differences from the paralinguistic information estimationmodel learning device 1 according to the first embodiment.

In step S31, the traditional teacher decision unit 31 decides atraditional teacher label from a paralinguistic information label of thelearning data. The traditional teacher label is a vector in which theparalinguistic information class is set to 1 and the otherparalinguistic information classes are set to 0 if the majority of alllisteners judge the same paralinguistic information for a certainspeech, as in the conventional technique. If the majority does not judgethe same paralinguistic information, that speech is not used for modellearning as no correct paralinguistic information. For example, in thefour-class emotion estimation of joy, sadness, anger, and calm, if threelisteners judge a certain speech as “joy”, “joy”, and “calm”,respectively, the traditional teacher label for that speech refers to afour-dimensional vector in which the “joy” class is 1 and the remainingthree classes of “sadness”, “anger”, and “calm” are 0. The traditionalteacher decision unit 31 outputs the determined traditional teacherlabel to the traditional teacher estimation model learning unit 32.

In step S32, the traditional teacher estimation model learning unit 32learns the traditional teacher estimation model based on the prosodicfeature output by the acoustic feature extraction unit 11 and thetraditional teacher label output by the traditional teacher decisionunit 31. As the estimation model, a model is used here that can handle amulti-class classification problem (a classification problem thatclassifies one speech into one class). The estimation model may be amodel based on deep learning as in the conventional technique, or may bea multiclass logistic regression, but the output here can be expressedas a probability value (probability that a certain paralinguisticinformation class is 1). The traditional teacher estimation modellearning unit 32 stores the learned traditional teacher estimation modelin the traditional teacher estimation model storage unit 40.

<Paralinguistic Information Estimation Device 4>

A paralinguistic information estimation device according to the secondembodiment estimates paralinguistic information of an input utterance byusing both the learned anti-teacher estimation model and the traditionalteacher estimation model. As illustrated in FIG. 8, a paralinguisticinformation estimation device 4 according to the second embodimentfurther includes a traditional teacher estimation model storage unit 40and a paralinguistic information estimation unit 41 in addition to theacoustic feature extraction unit 11 and the anti-teacher estimationmodel storage unit 20 of the first embodiment. This paralinguisticinformation estimation device 4 implements a paralinguistic informationestimation method according to the second embodiment by performing stepsof processing by way of example illustrated in FIG. 9.

Hereinafter, the paralinguistic information estimation device 4according to the second embodiment will be described with a focus on thedifferences from the paralinguistic information estimation device 2according to the first embodiment.

The traditional teacher estimation model storage unit 40 stores thetraditional teacher estimation model learned by the paralinguisticinformation estimation model learning device 3.

In step S41, the paralinguistic information estimation unit 41 estimatesparalinguistic information from the prosodic feature output by theacoustic feature extraction unit 11 based on both the anti-teacherestimation model stored in the anti-teacher estimation model storageunit 20 and the traditional teacher estimation model stored in thetraditional teacher estimation model storage unit 40. In the estimation,the paralinguistic information estimation unit 41 decides an estimationresult of paralinguistic information based on a weight differencebetween the output of the traditional teacher estimation model and theoutput of the anti-teacher estimation model for a certain prosodicfeature. This corresponds to performing paralinguistic informationestimation in consideration of both “probability that certainparalinguistic information is correct” and “probability that certainparalinguistic information is incorrect”.

The estimation of paralinguistic information is specifically expressedas follows.

$\begin{matrix}{{\hat{c}}_{k} = {\underset{c_{k}}{argmax}\left( {{\left( {1 - \alpha} \right){p\left( c_{k} \right)}} - {{\alpha q}\left( c_{k} \right)}} \right)}} & \left\lbrack {{Formula}\mspace{14mu} 3} \right\rbrack\end{matrix}$

Here, {circumflex over ( )}c_(k) represents a estimation result ofparalinguistic information; c_(k) represents the k-th paralinguisticinformation class; p(c_(k)) represents the probability that the k-thparalinguistic information class is correct, and is also an output ofthe traditional teacher estimation model; q(c_(k)) represents theprobability that the k-th paralinguistic information class is incorrect,and is also an output of the anti-teacher estimation model; and arepresents an estimation weight.

The estimation weight a is any value of continuous values of 0 to 1. Thecloser the estimation weight is to 0, the more important the“probability that certain paralinguistic information is correct” is, andthe closer it is to 1, the more important the “probability that certainparalinguistic information is incorrect” is. For example, the estimationweight is set to 0.3.

Modified Example of Second Embodiment

In the second embodiment, an example is described in which theanti-teacher estimation model learning unit and the traditional teacherestimation model learning unit are configured to perform completelyseparate processing. However, one learned estimation model may be usedas initial values for the other estimation model to learn the respectiveestimation models. Specifically, the anti-teacher estimation modelstored in the anti-teacher estimation model storage unit 20 is input tothe traditional teacher estimation model learning unit 32. Then, thetraditional teacher estimation model learning unit 32 sets theanti-teacher estimation model as the initial values for the traditionalteacher estimation model. Then, the traditional teacher estimation modellearning unit 32 learns the traditional teacher estimation model basedon the prosodic feature output by the acoustic feature extraction unit11 and the traditional teacher label output by the traditional teacherdecision unit 31. Alternatively, the traditional teacher estimationmodel stored in the traditional teacher estimation model storage unit 40is input to the anti-teacher estimation model learning unit 13. Then,the anti-teacher estimation model learning unit 13 sets the traditionalteacher estimation model as the initial values for the anti-teacherestimation model. Then, the anti-teacher estimation model learning unit13 learns the anti-teacher estimation model based on the prosodicfeature output by the acoustic feature extraction unit 11 and theanti-teacher label output by the anti-teacher decision unit 12. Thetraditional teacher estimation model and the anti-teacher estimationmodel learn the relevance between the prosodic feature and thetraditional teacher or between the prosodic feature and theanti-teacher, respectively, and thus it is expected that the estimationcriteria learned on one model is available on the other model.Accordingly, this additional processing may further improve the accuracyof paralinguistic information estimation.

Third Embodiment

In a third embodiment, paralinguistic information is estimated using amulti-task estimation model that estimates the traditional teacherestimation and the anti-teacher estimation at the same time. At thistime, in the model learning, the traditional teacher estimation and theanti-teacher estimation are simultaneously learned as multi-tasklearning. It is known that in the multi-task learning, by solvingdifferent tasks with a single model, common knowledge can be acquiredamong the tasks, and the estimation accuracy of each task is improved(see Reference 1 below). The third embodiment is a paralinguisticinformation estimation using both the traditional teacher and theanti-teacher as in the second embodiment, but the estimation modelitself can be improved by learning as the multi-task learning, therebyimproving the estimation accuracy.

-   [Reference 1] R. Caruana, “Multitask Learning”, Machine Learning,    vol. 28, pp. 41-75, 1997.

<Paralinguistic Information Estimation Model Learning Device 5>

A paralinguistic information estimation model learning device accordingto the third embodiment learns the multi-task estimation model from thesame learning data as in the first embodiment. As illustrated in FIG.10, a paralinguistic information estimation model learning device 5according to the third embodiment further includes a multi-taskestimation model learning unit 51 and a multi-task estimation modelstorage unit 60 in addition to the acoustic feature extraction unit 11and the anti-teacher decision unit 12 of the first embodiment, and thetraditional teacher decision unit 31 of the second embodiment. Thisparalinguistic information estimation model learning device 5 implementsa paralinguistic information estimation model learning method accordingto the third embodiment by performing steps of processing by way ofexample illustrated in FIG. 11.

Hereinafter, the paralinguistic information estimation model learningdevice 5 according to the third embodiment will be described with afocus on the differences from the paralinguistic information estimationmodel learning device 1 according to the first embodiment and theparalinguistic information estimation model learning device 3 accordingto the second embodiment.

In step S51, the multi-task estimation model learning unit 51 uses theprosodic feature output by the acoustic feature extraction unit 11, theanti-teacher label output by the anti-teacher decision unit 12, and thetraditional teacher label output by the traditional teacher decisionunit 31 to perform the multi-task learning and thus to learn themulti-task estimation model. Since an estimation model based on a neuralnetwork is generally used in the multi-task learning, the estimationmodel in the present embodiment is also an estimation model based on aneural network. For example, as illustrated in FIG. 10, the estimationmodel is an estimation model in which a branch structure for estimatingthe anti-teacher is added to the estimation model that is based on thedeep learning of the conventional technique. The multi-task estimationmodel learning unit 51 stores the learned multi-task estimation model inthe multi-task estimation model storage unit 60.

<Paralinguistic Information Estimation Device 6>

A paralinguistic information estimation device according to the thirdembodiment estimates paralinguistic information of an input utterance byusing the learned multi-task estimation model. As illustrated in FIG.12, a paralinguistic information estimation device 6 according to thethird embodiment further includes a multi-task estimation model storageunit 60 and a paralinguistic information estimation unit 61 in additionto the acoustic feature extraction unit 11 of the first embodiment. Thisparalinguistic information estimation device 6 implements aparalinguistic information estimation method according to the thirdembodiment by performing steps of processing by way of exampleillustrated in FIG. 13.

Hereinafter, the paralinguistic information estimation device 6according to the third embodiment will be described with a focus on thedifferences from the paralinguistic information estimation device 2according to the first embodiment and the paralinguistic informationestimation device 4 according to the second embodiment.

The multi-task estimation model storage unit 60 stores the multi-taskestimation model learned by the paralinguistic information estimationmodel learning device 5.

In step S61, the paralinguistic information estimation unit 61 estimatesparalinguistic information from the prosodic features output by theacoustic feature extraction unit 11 based on the multi-task estimationmodel stored in the multi-task estimation model storage unit 60. In theestimation, the class with the highest estimation output of thetraditional teacher for a certain prosodic feature is regarded as aparalinguistic information estimation result. Since the multi-tasklearning is used in the learning of the estimation model, it is possibleto perform paralinguistic information estimation in consideration of theinfluence of the anti-teacher (i.e., the traditional teacher beingestimated while not making a mistake in the anti-teacher), and thus toimprove paralinguistic information estimation accuracy.

MODIFIED EXAMPLES

In the above-described embodiments, an example is described in which theparalinguistic information estimation model learning device and theparalinguistic information estimation device are configured as separatedevices. However, in the embodiments of the present invention,alternatively, a single paralinguistic information estimation device maybe configured to have a function of learning a paralinguisticinformation estimation model and a function of estimating paralinguisticinformation using the learned paralinguistic information estimationmodel. Specifically, a paralinguistic information estimation deviceaccording to a modified example of the first embodiment includes theacoustic feature extraction unit 11, the anti-teacher decision unit 12,the anti-teacher estimation model learning unit 13, the anti-teacherestimation model storage unit 20, and the paralinguistic informationestimation unit 21. Further, a paralinguistic information estimationdevice according to a modified example of the second embodiment furtherincludes the traditional teacher decision unit 31, the traditionalteacher estimation model learning unit 32, the traditional teacherestimation model storage unit 40, and the paralinguistic informationestimation unit 41 in addition to the acoustic feature extraction unit11, the anti-teacher decision unit 12, the anti-teacher estimation modellearning unit 13, and the anti-teacher estimation model storage unit 20.Furthermore, a paralinguistic information estimation device of amodified example of the third embodiment further includes the multi-taskestimation model learning unit 51, the multi-task estimation modelstorage unit 60, and the paralinguistic information estimation unit 61in addition to the acoustic feature extraction unit 11, the anti-teacherdecision unit 12, and the traditional teacher decision unit 31.

In the embodiments of the present invention described above, needless tosay, a specific configuration is not limited to these embodiments, andeven a design and the like appropriately changed without departing fromthe spirit and scope of the present invention is included in the presentinvention. The various types of processing described in the embodimentsmay not only be executed in chronological order according to thedescription, but may also be executed in parallel or individually asrequired or depending on the processing capacity of the device thatexecutes the processing.

[Program and Recording Medium]

When various processing functions in each device described in the aboveembodiments are implemented by a computer, the processing contents ofthe functions to be included in each device are described by a program.Then, by executing this program on a computer, various processingfunctions of the above-described devices are implemented on thecomputer.

The program(s) describing the processing contents can be recorded in acomputer-readable recording medium. The computer-readable recordingmedium may be anything, for example, a magnetic recording device, anoptical disc, a magneto-optical recording medium, or a semiconductormemory.

Further, the distribution of this program is performed by selling,transferring, or lending a portable recording medium such as a DVD or aCD-ROM in which the program is recorded. Furthermore, the program may bestored in a storage device of a server computer so that the program canbe distributed by being transferred from the server computer to anothercomputer via a network.

A computer that executes such a program first stores, for example, theprogram recorded on a portable recording medium or the programtransferred from the server computer in its own storage device. Then,when processing is executed, the computer reads the program stored inits own storage device and executes the processing according to the readprogram. Further, as another execution form of this program, a computermay directly read the program from a portable recording medium andexecute processing according to the program, and each time the programis transferred from a server computer to this computer, the computer mayfurther sequentially execute processing according to the receivedprogram. In addition, a configuration may be provided in which aso-called ASP (Application Service Provider) service, which implementsthe processing functions only by an instruction of execution andacquisition of the result without transferring the program from theserver computer to this computer, executes the above-describedprocessing. Note that the program in the present embodiment includesinformation used for processing by a computer and equivalent to theprogram (e.g., data that is not a direct command to the computer but hasa property of defining processing on the computer).

Further, although the device is implemented by executing a predeterminedprogram on a computer in this embodiment, at least a part of theseprocessing contents may be realized by hardware.

REFERENCE SIGNS LIST

-   1, 3, 5 Paralinguistic information estimation model learning device-   11 Acoustic feature extraction unit-   12 Anti-teacher decision unit-   13 Anti-teacher estimation model learning unit-   20 Anti-teacher estimation model storage unit-   31 Traditional teacher decision unit-   32 Traditional teacher estimation model learning unit-   40 Traditional teacher estimation model storage unit-   51 Multi-task estimation model learning unit-   60 Multi-task estimation model storage unit-   2, 4, 6 Paralinguistic information estimation device-   21, 41, 61 Paralinguistic information estimation unit

1. A paralinguistic information estimation model learning device comprising: an anti-teacher determiner configured to determine, based on a paralinguistic information label indicating a determination result of paralinguistic information given by a plurality of listeners for each utterance, an anti-teacher label indicating an anti-teacher serving as incorrect paralinguistic information for the utterance; and an anti-teacher estimation model learner configured to learn, based on an acoustic feature extracted from the utterance and the anti-teacher label, an anti-teacher estimation model for outputting a posterior probability of anti-teacher for an input acoustic feature.
 2. The paralinguistic information estimation model learning device according to claim 1, further comprising: a traditional teacher determiner configured to determine, based on the paralinguistic information label, a traditional teacher label indicating a traditional teacher serving as correct paralinguistic information for the utterance; and a traditional teacher estimation model learner configured to learn, based on an acoustic feature extracted from the utterance and the traditional teacher label, a traditional teacher estimation model for outputting a posterior probability of traditional teacher for an input acoustic feature.
 3. A paralinguistic information estimation model learning device comprising: an anti-teacher determiner configured to determine, based on a paralinguistic information label indicating a determination result of paralinguistic information given by a plurality of listeners for each utterance, an anti-teacher label indicating an anti-teacher serving as incorrect paralinguistic information for the utterance; a traditional teacher determiner configured to determine, based on the paralinguistic information label, a traditional teacher label indicating a traditional teacher serving as correct paralinguistic information for the utterance; and a multi-task estimation model learner configured to perform multi-task learning based on an acoustic feature extracted from the utterance, the anti-teacher label, and the traditional teacher label, and learns a multi-task estimation model for outputting a posterior probability of anti-teacher and a posterior probability of traditional teacher for an input acoustic feature.
 4. The paralinguistic information estimation model learning device according to claim 3, wherein the multi-task estimation model is a model in which a branch structure for outputting the posterior probability of anti-teacher is added to a paralinguistic information estimation model that is based on deep learning for outputting the posterior probability of traditional teacher.
 5. A paralinguistic information estimation device comprising: an anti-teacher estimation model store configured to store an anti-teacher estimation model learned by a paralinguistic information estimation model learning device, wherein the paralinguistic information estimation model learning device comprises: an anti-teacher determiner configured to determine, based on a paralinguistic information label indicating a determination result of paralinguistic information given by a plurality of listeners for each utterance, an anti-teacher label indicating an anti-teacher serving as incorrect paralinguistic information for the utterance; and an anti-teacher estimation model learner configured to learn, based on an acoustic feature extracted from the utterance and the anti-teacher label, an anti-teacher estimation model for outputting a posterior probability of anti-teacher for an input acoustic feature; a paralinguistic information estimator configured to estimate, based on a posterior probability of anti-teacher obtained by inputting an acoustic feature extracted from an input utterance into the anti-teacher estimation model, paralinguistic information of the input utterance.
 6. The paralinguistic information estimation device according to claim 5, further comprising: a traditional teacher estimation model store configured to store the traditional teacher estimation model learned by a paralinguistic information estimation model learning device, wherein the paralinguistic information estimation model learning device further comprising: a traditional teacher determiner configured to determine, based on the paralinguistic information label, a traditional teacher label indicating a traditional teacher serving as correct paralinguistic information for the utterance; and a traditional teacher estimation model learner configured to learn, based on an acoustic feature extracted from the utterance and the traditional teacher label, a traditional teacher estimation model for outputting a posterior probability of traditional teacher for an input acoustic feature, and wherein the paralinguistic information estimation unit estimates the paralinguistic information of the input utterance based on a weight difference between the posterior probability of anti-teacher obtained by inputting the acoustic feature into the anti-teacher estimation model and the posterior probability of traditional teacher obtained by inputting the acoustic feature into the traditional teacher estimation model.
 7. (canceled)
 8. (canceled)
 9. The paralinguistic information estimation model learning device according to claim 1, wherein the paralinguistic information includes one or more of emotions being joy, sadness, anger, or calm from the utterance.
 10. The paralinguistic information estimation model learning device according to claim 1, wherein the anti-teacher includes a piece of paralinguistic information determined by a group of listeners, and wherein a number of listeners in the group of listeners is less than a threshold as compared to a number of the plurality of listeners.
 11. The paralinguistic information estimation model learning device according to claim 3, wherein the paralinguistic information includes one or more of emotions being joy, sadness, anger, or calm from the utterance.
 12. The paralinguistic information estimation model learning device according to claim 3, wherein the anti-teacher includes a piece of paralinguistic information determined by a group of listeners, and wherein a number of listeners in the group of listeners is less than a threshold as compared to a number of the plurality of listeners.
 13. The paralinguistic information estimation device according to claim 5, wherein the paralinguistic information includes one or more of emotions being joy, sadness, anger, or calm from the utterance.
 14. The paralinguistic information estimation device according to claim 5, wherein the anti-teacher includes a piece of paralinguistic information determined by a group of listeners, and wherein a number of listeners in the group of listeners is less than a threshold as compared to a number of the plurality of listeners. 